There is no r in lasso, but plenty of lasso in R!

Niels Richard Hansen
16/02-2016

Lasso

The lasso acronym means

least absolute shrinkage and selection operator

and was introduced by Robert Tibshirani in

Regression Shrinkage and Selection via the Lasso

published in the first issue of Journal of the Royal Statistical Society: Series B, 1996.

Pronunciation

You say las-SOO

and I say LAs-so.

From Spanish lazo.

Setup

We consider linear regression models of a response \( Y \) on a vector \( X = (X_1, \ldots, X_p) \) of predictors: \[ E(Y) = \mathbf{X}^T \beta= \sum_{j=1}^p X_j \beta_j. \]

With \( n \) observations we organize \( Y_1, \ldots, Y_n \) in a vector \( \mathbf{Y} \) and \( X_1, \ldots, X_n \) in a \( n \times p \) matrix \( \mathbf{X} \).

The least squares estimator is \[ \hat{\beta} = \mathop{\arg \min}\limits_{\beta} \|\mathbf{Y} - \mathbf{X}\beta\|_2^2 = \mathop{\arg \min}\limits_{\beta} \sum_{i=1}^n (Y_i - X_i^T \beta)^2. \] where \( \|\mathbf{z}\|_2^2 = \sum_{i=1}^n z_i^2 \) denotes the Euclidean norm.

Prostate cancer data example

  • lpsa: log(prostate specific antigen)
  • lcavol: log(cancer volume)
  • lweight: log(prostate weight)
  • age: age of patient
  • lbph: log(benign prostatic hyperplasia amount)
  • svi: seminal vesicle invasion
  • lcp: log(capsular penetration)
  • gleason: gleason score
  • pgg45: percent gleason scores 4 and 5

Prostate cancer data example

The response, \( Y \), will be lpsa. The remaining 8 variables will be the predictors in \( X \).

     lpsa  lcavol lweight age   lbph svi    lcp gleason pgg45
1 -0.4308 -0.5798   2.769  50 -1.386   0 -1.386       6     0
2 -0.1625 -0.9943   3.320  58 -1.386   0 -1.386       6     0
3 -0.1625 -0.5108   2.691  74 -1.386   0 -1.386       7    20
4 -0.1625 -1.2040   3.283  58 -1.386   0 -1.386       6     0

All variables are standardized upfront.

prostate <- scale(prostate, TRUE, TRUE)

Spearman correlations

plot of chunk spearman

A linear model

## Intercept removed due to standardization
prostateLm <- lm(lpsa ~ . - 1, 
                 data = as.data.frame(prostate))
summary(prostateLm)
...
Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
lcavol    0.5762     0.0892    6.46  5.4e-09 ***
lweight   0.2309     0.0741    3.11   0.0025 ** 
age      -0.1370     0.0711   -1.93   0.0571 .  
lbph      0.1216     0.0724    1.68   0.0966 .  
svi       0.2732     0.0860    3.18   0.0021 ** 
lcp      -0.1285     0.1082   -1.19   0.2385    
gleason   0.0308     0.0966    0.32   0.7507    
pgg45     0.1089     0.1061    1.03   0.3072    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.603 on 89 degrees of freedom
Multiple R-squared:  0.663, Adjusted R-squared:  0.633 
F-statistic: 21.9 on 8 and 89 DF,  p-value: <2e-16```


A lasso fit using glmnet

prostateLasso <- glmnet(x = prostate, y = prostate[, "lpsa"], 
                        exclude = 1, intercept = FALSE)
plot(prostateLasso, label = TRUE, lwd = 2)

plot of chunk prostateLasso

A lasso fit from Tibshirani's paper

Tib's figure

What is lasso?

Lasso is a regression technique that gives a family of coefficients \( ({}^{s \!}\beta)_{s \geq 0} \) indexed by a tuning parameter \( s \geq 0 \).

  • For \( s \nearrow s_{\mathrm{max}} \) it holds that \( {}^{s \!}\beta \to \hat{\beta} \) (a least squares estimate).
  • For \( s < s_{\mathrm{max}} \) the estimate \( {}^{s \!}\beta_i \) is shrunk toward 0 compared to \( \hat{\beta}_i \).
  • For small enough \( s \) some coefficients are shrunk all the way to 0. This gives lasso the selection property.

L1-constrained regression

_3dballsnapshot
You must enable Javascript to view this page properly.

\[ {}^{1 \!}\hat{\beta} = \mathop{\arg \min}\limits_{\beta: \|\beta\|_1 \leq 1} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2. \]

where \[ \|\beta\|_1 = \sum_{i=1}^p |\beta_i| \]

L1-constrained regression

_3dballcutsnapshot
You must enable Javascript to view this page properly.

\[ {}^{s \!}\hat{\beta} = \mathop{\arg \min}\limits_{\beta: \|\beta\|_1 \leq s} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2. \]

Lasso defines a family of estimates.

The penalized version of lasso

\[ \beta^{\lambda} = \mathop{\arg \min}\limits_{\beta} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2 + \lambda \|\beta\|_1. \]

The monotonely decreasing data dependent map

\[ s(\lambda) = \|\beta^{\lambda}\|_1 \]
from \( (0,\infty) \) to \( (0, s_{\mathrm{max}}) \) gives the relation

\[ \beta^{\lambda} = {}^{s(\lambda) \!}\beta \]
between the contrained lasso and the penalized lasso.

Soft thresholding

If \( \mathbf{X} \) is orthogonal, i.e. \( \mathbf{X}^T \mathbf{X} = \mathbf{I}_p \), then \[ \beta_i^{\lambda} = \mathrm{sign}(\hat{\beta}_i) \max\{|\hat{\beta}_i| - \lambda, 0\} \] is soft thresholding of the least squares estimator.

This can be compared to hard thresholding \[ \hat{\beta}_i 1(|\hat{\beta}_i| > \lambda) \]

and linear shrinkage \[ \frac{\hat{\beta}_i}{1 + \lambda} \] related to ridge regression.

Soft thresholding (\(\lambda = 1\))

plot of chunk threshold

Soft thresholding (\(\lambda = 1.5\))

plot of chunk threshold2

Inspiration and related early work

Computing the lasso solution

Lasso is for fixed \( s \) the solution of a quadratic optimization problem with \( 2^p \) linear inequality contraints.

Tibshirani proposed an interative algorithm for adding active contraints. It requires the sequential solution of quadractic optimization problems with linear inequality contraints.

Tibshirani implemented public domain functions for S-PLUS.

To obtain them, use file transfer protocol to lib.stat.cmu.edu and retrieve the file S/lasso, or send an electronic mail message to statlib@lib.stat.cmu.edu with the message send lasso from S.

The LARS algorithm

Homotopy methods as in On the LASSO and its Dual, Osborne, Presnell and Turlach, 2000, JCGS, compute the entire solution path \[ \lambda \mapsto \beta^{\lambda}. \]

The least angle regression (LAR) and its lasso modification (LARS) was developed by Efron et al., and is a fast homotopy algorithm.

prostateLars <- lars(x = prostate[, -1], y = prostate[, "lpsa"], 
                     intercept = FALSE, normalize = FALSE)

This is a pure R implementation!

The LARS fit

plot(prostateLars)

plot of chunk larsPlot

The LARS fit

coefficients(prostateLars)
      lcavol lweight     age   lbph   svi    lcp gleason   pgg45
 [1,]  0.000   0.000  0.0000 0.0000 0.000  0.000  0.0000 0.00000
 [2,]  0.365   0.000  0.0000 0.0000 0.000  0.000  0.0000 0.00000
 [3,]  0.400   0.000  0.0000 0.0000 0.035  0.000  0.0000 0.00000
 [4,]  0.483   0.149  0.0000 0.0000 0.158  0.000  0.0000 0.00000
 [5,]  0.487   0.161  0.0000 0.0000 0.166  0.000  0.0000 0.00917
 [6,]  0.505   0.182  0.0000 0.0443 0.199  0.000  0.0000 0.03388
 [7,]  0.517   0.202 -0.0519 0.0763 0.211  0.000  0.0000 0.05595
 [8,]  0.522   0.213 -0.0812 0.0937 0.219  0.000  0.0107 0.06064
 [9,]  0.576   0.231 -0.1370 0.1216 0.273 -0.128  0.0308 0.10891

Elements of Statistical Learning (ESL)


History

  • ESL first edition was published in 2001.
  • The lars package (a path algorithm) was available from CRAN in May 2003.
  • I read ESL in 2004 and gave a course at UCPH.
  • Least Angle Regression with the LARS algorithm was published in 2004.
  • The glmnet package (cyclic coordinatewise descent) was available from CRAN in June 2008.
  • ESL second edition was published in 2009 introducing me to elastic net and the glmnet package.
  • An Introduction to Statistical Learning with applications in R by James, Witten, Hastie and Tibshirani was published in 2013.

Coordinate descent

The coordinate descent algorithm solves the lasso optimization problem very efficiently.

  • Wenjiang Fu proposed in Penalized regressions: the bridge versus the lasso (1998) his “shooting algorithm”.
  • The use of coordinate descent was neglected for a period.
  • Friedman was external examiner in 2006 on a PhD by Anita van der Kooij, who used coordinate descent, and the algorithm was taken up again.
  • Hastie says that efficiency of glmnet is due to FFT1).
  • FFT = Friedman + Fortran + Tricks.

1) If you don't get the joke look up the fast Fourier transform.

Impact

From P. Bühlmann's discussion of R. Tibshirani, Regression shrinkage and selection via the lasso: a retrospective, JRSSB, 2011.

Prediction of tumor site

Multinomial lasso

The squared error loss is replaced by the negative log-likelihood:

\[ \beta^{\lambda} = \mathop{\arg \min}\limits_{\beta} \ell(\beta) + \lambda \|\beta\|_1. \]

With \( K \) groups and \( p \) predictors there will be \( Kp \) parameters.

In the example we have \( K = 9 \) and \( p = 384 \) giving 3456 parameters.

Multinomial lasso with glmnet

load("miRNA.RData")
miRNAlasso <- glmnet(Xprim, Yprim, family = "multinomial")
plot(miRNAlasso)

plot of chunk unnamed-chunk-1

Cross-validation with glmnet

miRNAlasso <- cv.glmnet(Xprim, Yprim, family = "multinomial", 
                        type.measure = "class")
plot(miRNAlasso, ylim = c(0, 0.4))

plot of chunk miRNAcv

Group lasso

Ordinary lasso penalty

Group lasso penalty

Sparse group lasso

Group lasso penalty

Sparse group lasso penalty

Sparse group lasso

The (default) sparse group lasso penalty for multinomial lasso is defined as \[ \begin{aligned} \mathrm{Pen}(\beta) & = \alpha \sum_{k,j} |\beta_{kj}| + (1-\alpha) \sum_{k=1}^M \gamma_k \sqrt{\sum_{j=1}^p \beta_{kj}^2} \\ & = \alpha \|\beta\|_1 + (1-\alpha) \sum_{k=1}^M \gamma_k \|\beta_{k\cdot}\|_2 \end{aligned} \] for \( \alpha \in [0,1] \). The value of \( \alpha = 1 \) gives lasso and \( \alpha = 0 \) gives group lasso.

The group lasso gives parameters associated with a predictor that are either all 0 or all different from 0. The sparse group lasso encourages a group selection, but allows for selection of only some parameters associated with a predictor.

The msgl package (\(\alpha = 0.5\), the default)

lambda <- msgl.lambda.seq(Xprim, Yprim, lambda.min = 0.01)
miRNAmsgl <- msgl.cv(Xprim, Yprim, lambda = lambda)

plot of chunk miRNAplot

The msgl package (\(\alpha = 0\), group lasso)

lambda <- msgl.lambda.seq(Xprim, Yprim, alpha = 0, 
                          lambda.min = 0.01)
miRNAmsgl <- msgl.cv(Xprim, Yprim, lambda = lambda, alpha = 0)

plot of chunk miRNAplot2

The msgl package

The R package msgl implements sparse group lasso for multinomial models.

It supports the default grouping of all parameters associated with one predictor, and it supports user defined groups of predictors.

The glmnet supports group lasso for multinomial models with the default grouping but not sparse group lasso or grouping of predictors.

Multinomial group lasso with glmnet

miRNAlasso <- cv.glmnet(Xprim, Yprim, family = "multinomial", 
                        type.measure = "class", 
                        type.multinomial = "grouped")
plot(miRNAlasso, ylim = c(0, 0.4))

plot of chunk miRNAcvmult

Professional R users

A professional cowboy recognizes the layman by his usage of the word lasso.

To the cowboy it is simply the rope.

My next cool statistical method has to be called Rope.

There is R in Rope, and, hopefully, the professional R users will recognize the layman by his usage of lasso - once we have Rope!

Thanks!

Packages

The packages used explicitly in the presentation:

library("glmnet")   ## Lasso via coordinate descent
library("lars")     ## Lasso via the lar algorithm
library("msgl")     ## Multinomial sparse group lasso
library("reshape2") ## For 'melt'
library("ggplot2")  ## Plotting
library("lattice")  ## Plotting
library("rgl")      ## 3d 
library("misc3d")   ## More 3d 

Other packages worth mentioning:

library("LiblineaR")  ## Fast logistic lasso etc.
library("grpreg")     ## Group lasso etc.
library("glamlasso")  ## Lasso for array models